integration for caching embeddings #27

GabrieleGhisleni · 2024-05-13T13:45:47Z

Hi all, following our recent discussion in Issue #19334, I have opened this pull request to specifically adds the caching backend for embeddings.

maxjakob · 2024-05-14T08:58:36Z

@GabrieleGhisleni to fix the Python 3.8 issues we need to from typing import Dict, List and use these two classes instead of dict and list in type annotations.

maxjakob · 2024-05-14T12:05:13Z

@GabrieleGhisleni Thank you for contributing! Let me know when this is ready for review, happy to take a look.

GabrieleGhisleni · 2024-05-15T06:39:55Z

@maxjakob Thank you for your willingness to review! The changes are not quite ready yet, but I'll let you know as soon as they are. I appreciate your patience!

GabrieleGhisleni · 2024-05-22T14:28:49Z

@maxjakob The pull request should be ready to be reviewed.

maxjakob

Nice job so far! I'm mainly wondering if there is a hard requirement to store the vectors as lists of floats or if we can store them as bytes and integrate a little nicer with how other caches work?

libs/elasticsearch/README.md

maxjakob · 2024-05-23T11:15:28Z

libs/elasticsearch/README.md

+
+Caching embeddings is obtained by using the [CacheBackedEmbeddings](https://python.langchain.com/docs/modules/data_connection/text_embedding/caching_embeddings),
+in a slightly different way than the official documentation.


Can you add something on how (and maybe why) it is different?

"In a different way" because the documentation always mentions only the from_bytes_store method, whereas we show that it can be instantiated directly as an argument of CacheBackedEmbeddings.

libs/elasticsearch/langchain_elasticsearch/cache.py

libs/elasticsearch/pyproject.toml

maxjakob · 2024-05-23T11:48:10Z

libs/elasticsearch/langchain_elasticsearch/cache.py

+        if self._metadata is not None:
+            body["metadata"] = self._metadata
+        if self._store_input:
+            body["text_input"] = text_input


I assume you want to store these so you can inspect the cache and learn something about usage statistics? Wondering if this is the right way to do this (or rather collect stats differently), but I'm not opposed to it

The core idea behind storing both the text and embeddings together is to ensure that we have the possibility to reconstruct the model or utilize them in other ways in the future. It is just to preserve valuable information for potential future use cases.

libs/elasticsearch/langchain_elasticsearch/cache.py

maxjakob · 2024-05-29T14:40:24Z

@GabrieleGhisleni no rush on this at all from my side, but feel free to tag me once the merge conflicts are resolved and this is ready for review again :)

GabrieleGhisleni · 2024-05-30T07:28:33Z

@maxjakob great! the conflicts should be fixed

maxjakob

Great improvements! I left some more comments for cosmetic changes.

libs/elasticsearch/README.md

libs/elasticsearch/tests/conftest.py

libs/elasticsearch/tests/unit_tests/test_imports.py

libs/elasticsearch/langchain_elasticsearch/__init__.py

libs/elasticsearch/langchain_elasticsearch/_utilities.py

libs/elasticsearch/langchain_elasticsearch/cache.py

GabrieleGhisleni · 2024-05-30T15:25:24Z

@maxjakob I should have fix the last suggestions. Let me know if it seems ok to you.

maxjakob

Awesome work @GabrieleGhisleni! Thank you for your contribution.

maxjakob · 2024-05-30T15:29:04Z

I will release this tomorrow.

…22612) The package for LangChain integrations with Elasticsearch https://github.com/langchain-ai/langchain-elastic contains a Elasticsearch byte store cache integration (see langchain-ai/langchain-elastic#27). This is the documentation contribution on the page dedicated to stores integrations Co-authored-by: Gabriele Ghisleni <[email protected]>

Gabriele Ghisleni added 3 commits May 13, 2024 15:43

add cache embeddings

2de4eff

fix typo

86cbac0

fix typo

ed30b0a

GabrieleGhisleni changed the title ~~add cache embeddings~~ cache embeddings May 13, 2024

GabrieleGhisleni changed the title ~~cache embeddings~~ integration for caching embeddings May 13, 2024

Gabriele Ghisleni added 4 commits May 14, 2024 08:58

run lint and fixes

0bc4ace

run lint and fixes

8c03dec

add integration test

c7e9046

add integration test

dffb6fc

add type annotation

7889fdb

Gabriele Ghisleni added 8 commits May 16, 2024 14:25

add type annotation

ce0ecc4

add integration tests and collapsce function

865ed7e

add integration tests and collapsce function

d8cf6ee

add integration tests and collapsce function

96e308f

add integration tests and collapsce function

591239c

fix docstring

d2d9762

minor refactoring

7edee53

minor refactoring

f3876a4

GabrieleGhisleni marked this pull request as ready for review May 22, 2024 14:27

maxjakob self-requested a review May 23, 2024 10:43

maxjakob requested changes May 23, 2024

View reviewed changes

Gabriele Ghisleni added 2 commits May 24, 2024 09:42

as bytestore

531046e

as bytestore

bfd9aab

Gabriele Ghisleni added 2 commits May 30, 2024 08:26

fix toml

b7a5d7d

fix toml

97beb53

Gabriele Ghisleni added 3 commits May 30, 2024 09:21

Merge branch 'main' into cache-embeddings

77f0657

fix toml

0af691e

fix toml

8587b44

maxjakob reviewed May 30, 2024

View reviewed changes

Gabriele Ghisleni added 3 commits May 30, 2024 15:31

removed setup connection

1dc2468

removed setup connection

c96cd83

removed setup connection

2a3256c

maxjakob approved these changes May 30, 2024

View reviewed changes

maxjakob merged commit aa9c150 into langchain-ai:main May 30, 2024
11 checks passed

GabrieleGhisleni mentioned this pull request Jun 6, 2024

docs: ElasticsearchCacheStore in stores integrations documentation langchain-ai/langchain#22612

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

integration for caching embeddings #27

integration for caching embeddings #27

GabrieleGhisleni commented May 13, 2024 •

edited

Loading

maxjakob commented May 14, 2024

maxjakob commented May 14, 2024

GabrieleGhisleni commented May 15, 2024

GabrieleGhisleni commented May 22, 2024

maxjakob left a comment

maxjakob May 23, 2024

GabrieleGhisleni May 24, 2024

maxjakob May 23, 2024

GabrieleGhisleni May 24, 2024

maxjakob commented May 29, 2024

GabrieleGhisleni commented May 30, 2024

maxjakob left a comment

GabrieleGhisleni commented May 30, 2024

maxjakob left a comment

maxjakob commented May 30, 2024


		Caching embeddings is obtained by using the [CacheBackedEmbeddings](https://python.langchain.com/docs/modules/data_connection/text_embedding/caching_embeddings),
		in a slightly different way than the official documentation.

integration for caching embeddings #27

integration for caching embeddings #27

Conversation

GabrieleGhisleni commented May 13, 2024 • edited Loading

maxjakob commented May 14, 2024

maxjakob commented May 14, 2024

GabrieleGhisleni commented May 15, 2024

GabrieleGhisleni commented May 22, 2024

maxjakob left a comment

Choose a reason for hiding this comment

maxjakob May 23, 2024

Choose a reason for hiding this comment

GabrieleGhisleni May 24, 2024

Choose a reason for hiding this comment

maxjakob May 23, 2024

Choose a reason for hiding this comment

GabrieleGhisleni May 24, 2024

Choose a reason for hiding this comment

maxjakob commented May 29, 2024

GabrieleGhisleni commented May 30, 2024

maxjakob left a comment

Choose a reason for hiding this comment

GabrieleGhisleni commented May 30, 2024

maxjakob left a comment

Choose a reason for hiding this comment

maxjakob commented May 30, 2024

GabrieleGhisleni commented May 13, 2024 •

edited

Loading